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(54) Cache coherence network for a multiprocessor data processing system 



(57) A cache coherence network for transferring 
coherence messages between processor caches in a 
multiprocessor data processing system is provided. The 
network includes a plurality of processor caches associ- 
ated with a plurality of processors, and a binary logic tree 
circuit which can separately adapt each branch of the 
tree from a broadcast configuration during low levels of 
coherence traffic to a ring configuration during high lev- 
els of coherence traffic. A cache snoop-in input receives 
coherence messages and a snoop-out output outputs, 
at the most, one coherence message per current cyde 
of the network timing. A forward signal on a forward out- 
put indicates that the associated cache is outputting a 



message on snoop-out during the current cycle. A cache 
outputs received messages in a queue on the snoop-out 
output, after determining any response message based 
on the received message. The binary logic tree circuit 
has a plurality of binary nodes connected in a binary tree 
structure. Each branch node has a snoop-in, a snoop- 
out, and a fonward connected to each of a next higher 
level node and two lower level nodes. A fonward signal 
on a fonvard output indicates that the associated node 
is outputting a message on snoop-out to the higher node 
during the curent cycle. Each branch ends with multiple 
connections to a cache at the cache's snoop-In input, 
snoop-out output and fonward output. 
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Description 

The present invention relates in general to cache 
coherence networks for multiprocessor data processing 
systems. 

A cache coherence network connects a plurality of 
caches to provide the transmission of coherence mes- 
sages between the caches, which allows the caches to 
maintain memory coherence. A snoopy cache coher- 
ence mechanism is widely used and well understood as 
used In multiprocessor systems. Snoopy cache coher- 
ence in multiprocessor systems use a single bus as a 
data transmission media. The single bus allows mes- 
sages and data to be broadcast to ail caches on the bus 
at the same time. A cache monitors (snoops on) the bus 
and automatically invalidates data it holds when the 
address of a write operation seen on the bus matches 
the address the cache holds. 

A single bus cache coherence network becomes 
impractical in medium-to-large multiprocessor systenos. 
As the number of processors in the system increases, a 
significant load is placed on the bus to drive the larger 
capacity, and the volume of traffic on the bus is substan- 
tially increased. Consequerrtiy. cycle time of the snoopy 
bus scales linearly with the number of caches attached 
to the bus. At some point, the cyde time of the snoopy 
bus will become larger than the cyde time of the proces- 
sors themselves, resulting in a saturation of the bus. 
Combining this with the fixed throughput of one coher- 
ence message per cycle of the bus, the bus quickly sat- 
urates as the number of caches attached to the bus 
increases. Thus, there is a limit to the number of caches 
that can be maintained effectively on a single snoopy 
bus. What is needed is an interconnection network that 
can adapt under the heavy electrical loading and 
increased traffic conditions that may result in a targe mul- 
tiprocessor system, thus, providing scalability to the sys- 
tem. It would be further desirable to provide an 
interconnection network that acts logically like, and 
affords a broadcast capability like, the snoopy bus. 

It is the object of the present invention to provide an 
adaptive, scalable cache coherence network for a data 
processing system which acts like a snoopy bus and 
which provides broadcast capability. 

The foregoing objects are achieved as is now 
desaibed. According to the present invention as 
claimedi a cache coherence network for transferring 
coherence messages between processor caches in a 
nrujftiprocessor data processing system is provided. The 
network includes a plurality of processor caches assod- 
ated with a plurality of processors, and a t>inary k>gk: tree 
circuit which can separately adapt each branch of the 
tree from a broadcast corrfiguration during low levels of 
coherence traffic to a ring configuration during high lev- 
els of coherence traffic. 

In at least a preferred emkxxliment, each cache has 
a snoop-in input, a snoop-out output, and a forward out- 
put, wherein the snoop-in input receives coherence mes- 
sages and the snoop-out output outputs, at the most, one 



coherence message per current cycle of the network tim- 
ing. A fonward signal on a fonward output indicates that 
the associated cache is outputting a message on the 
snoop-out during the current cycle. A cache generates 

5 coherence messages according to a coherency protocol, 
and. further, each cache stores messages received on 
the snoop-in input in a message queue and outputs mes- 
sages loaded in the queue on the snoop-out output, after 
detennining any response message fc>ased on the 

10 received message. 

The binary logic tree circuit has a plurality of binary 
nodes connected in a binary tree structure, starting at a 
top root node and having multiple branches formed of 
branch nodes positioned at multiple levels of a branch. 

15 Each branch node has a snoop-in. a snoop-out. and a 
fonward output connected to each of a next higher level 
node and two lower level nodes, such that a branch node 
is connected to a higher node at a next higher level of 
the tree structure, and to a first tower node and second 

20 kw&r node at a next lower level of the tree structure. A 
fonward signal on a fbnward output indk^ates that the 
assodated node is outputting a message on snoop-out 
to the higher node during the current cyde. Each branch 
ends with multiple connections to a cache at the cache's 

25 snoop-in input, snoop-out output, and forward output, 
wherein the cache forms a bottom level node. 

Tlie invention will best be understood by reference 
to the following detailed description of an illustrative 
embodiment when read in conjunction with the accom- 

30 panying drawings, wherein: 

Rgure 1 depicts a block diagram of a cache coher- 
ence network; 

35 Figure 2 shows a schematic diagram of a preferred 
embodiment of a cache coherence network; 

Rgure 3 shows a schematic diagram of the logic cir- 
cuit of a preferred embodiment of a network node; 

40 

Figures 4 - 7 are the four possible port connection 
configurations of the logic circuit of Rgure 3, as it is 
used in tiie embodiment of Rgure 2; 

45 Rgure 8 shows the connections arxi message 
transmission flow during a cyde of the cache coher- 
ence network, under conditions of a first example; 

Rgure 9 shows the connections and message 
so transmissk>n ftow during a cyde of the cache coher- 
ence network, under conditions of a second exam- 
ple; 

Rgure 10 shows the connections and message 
55 transmission flow during a cyde of the cache coher- 
ence networK under conditions of a third example;. 

Rgure 1 1 shows a schematic diagram of a logic cir- 
cuit of a preferred embodiment of a network node. 
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With reference now to the figures and In particular 
with reference to Figure 1 , there is depicted a block dia- 
gram of a cache coherence network. Network logic tree 
1 0 is connected to a plurality of processor/caches Po-Pn. 
i. Each processor/cache Pj (Pn.^ a: Pj s Pq) represents 
a processor with an associated cache, although the 
physical implementation may not have the cache integral 
to the processor as shown by the blocks In Figure 1 . The 
processor caches are also connected through a sepa- 
rate data communications bus (not shown) for transfer- 
ring data blocks of memory between the processors and 
the system's main memory. 

As seen in Rgure 1 , each processor Po - Pn*1 has 
three connectbns to the network: snoop-out (SO), For- 
ward (F). snoop-in (SI). The F signal output from a proc- 
essor is a single bit signal. The SO and SI signals are 
mutti-k)it signals candied over a multi-bit bus. The informa- 
tion flowing over the network from the SO and SI ports 
is referred to as coherence traffic and can be divided into 
two categories: coherence requests and coherence 
responses. The requests and responses are in the form 
of packetized messages which travel In the network as 
a single uninterrupted unit. Coherence requests are ini- 
tiated by a cache in response to a main memory access 
by its processor. A coherence response typically is initi- 
ated by other caches responding to requests which they 
have received on their SI inputs. An example of a coher- 
ence request would be a message asking a cache to 
invalidate a block of data. For example, (tag id) DCache- 
block-f lush. An example of a coherence response would 
be an acknowledge message indicating the data-block 
has been invalidated in the cache. For example, Ack, {tag 
id). The coherence messages used in the cache coher- 
ence network of the present invention could take on 
many forms, including those well known and often used 
in current snoopy coherency schemes. 

The SO output is used for outputting a number of 
messages onto the network. The network is timed, so 
that a cache may output only one message during each 
cycle of the network timing. The cache may issue a new 
coherence request, or it may respond to a coherence 
request by generating a response, or it may simply pass 
on a request that it had received earlier over its SI port. 
When a cache uses its SO port to output a coherence 
message, it requests partk;ipation in the coherence traf- 
fk: ever the network by negating its F signal. When a 
cache is not requesting partiopation in the coherence 
traffic, it always asserts its F signal and outputs a 
negated signal on the SO port (i.e.. SO = 0). 

A cache always receives coherence requests or 
responses from other caches on its SI input. A cache 
deletes a request It receives from the coherence traffic 
on the SI port, if it is one it had sent out earlier over the 
SO port to be issued to the other processors in the net- 
work. Sultat)le identification fields are placed within each 
coherence message when It is sent out from an SO port, 
thus enabling a receiving cache to identify the originating 
cache of the message. In this way, a cache is able to 
identify its own messages which it had sent out over the 



network at a previous cycle, and to delete the message. 
This message will be deleted regardless of whether the 
F signal is asserted at the time of receipt. 

A cache maintains a queue of incoming requests on 

5 its SI port. This queue (not shown) is necessary because 
over a given period of time the cache may be generating 
its own coherence messages faster than it can evaluate 
and/or rebroadcast the received messages. The cache 
will delete a message from the Si queue if the message's 

10 identification fieki shows it to be a message originating 
from that cache. 

In any cache coherence protocol which might be 
used with the preferred embodiment, the cache gener- 
ates a response message if a received message is rel- 

15 evant to its own contents and warrants a response. In 
addition, the cache may either fbnward a received 
request out onto the network over its SO port or ignore it. 

In accordance with the present invention, if the 
cache had asserted the F signal when it received a par- 

20 ticular coherence request, the next processor in the net- 
work must also have received that request (as explained 
below), in that case, there is no need for the cache to 
fonward the message to the next cache in the network. If 
the cache had negated the F signal at the time it received 

25 the coherence request, and therefore had itself sourced 
a valid coherence message to Its SO port simultane- 
ously, the cache had clipped the broadcast mechanism 
(as explained below) and must fonward the received 
coherence request to the next cache in the network 

30 What constitutes the "next" cache in the network may be 
logically different than the physical makeup of the com- 
puter system. The "next" cache or processor is deci- 
phered from the logic of the network logic tree 10, which 
is made up of the network nodes. In the preferred embod- 

35 iment as shown in Rgure 2, it will be shown that, 
because of the logic circuitry, a "next" processor Is the 
processor to the left of a given processor, and is labelled 
with a higher reference number (i.e. PI > PO) . But 
because of the network connection at the root node of 

40 the tree, PO is the "next" processor after processor P7. 
Along with saving the incoming message in the SI 
queue, the receiving cache saves the cunrent state of the 
F signal at the time it receives the queued message. 
Preferably, the F signal is saved with the message in the 

4S SI queue. To deternvne whether to forward a received 
message out onto the network, the cache will check the 
state of the F signal at the time that the coherence mes- 
sage was received, which was stored in the message 
queue at the same time as the message. 

50 Refemng now to Figure 2, there is depicted a pre- 
ferred embodiment of an adaptable, scalable binary tree 
cache coherence network in a multiprocessor data 
processing system, according to the present invention. 
The network is comprised of eight processors and thar 

55 associated caches, PO - P7, and the network nodes, 
N0DE1-7. Together they form a network by which the 
processors PO - P7 can eff k:tently pass coherence mes- 
sages to maintain coherent memory wrtfvn their caches. 
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This network is able to adapt to varying volumes and 
kinds of coherence messages being transmitted over the 
network. Tlie binary tree structure of the transmission 
network has a cycle time which scales to the logarithm 
of the number of caches (i.e.. processors) connected to 
the network. This enables the network of the present 
invention to be scalable to medium-sized to large-sized 
multiprocessor systems. When there is light traffic on the 
network, processors are able to broadcast coherence 
messages to other processors, providing quick and effi- 
cient cache coherence mechanism. As coherence traffic 
increases, the network is able to adapt and pass mes- 
sages in a ring-like nnanner to the next processor in the 
network. In that configuration, the networlc bandwidth is 
increased by allowing p^elining of coherence traffic, tn 
fact, the throughput of coherent messages through the 
network can be as high as the number of caches in the 
network Also, the ring connections substantially reduce 
driving requirements. Moreover, the network is also able 
to adapt to varying degrees of increased traffic by seg- 
menting itself into broadcast sections and ring sections, 
depending on the locality of increased traffic. 

The network logic tree 1 0 (in Figure 1 ) is comprised 
of a plurality of network nodes connected together in a 
binary logic tree structure, and each of the processors 
of the multiprocessor system are connected at the leaves 
of the binary logic tree. In the preferred embodiment of 
Rgure 2, the network logic tree comprises root node 
N0DE1 at the top level of the tree and branch nodes 
NODE2-7 formed along branches at lower levels of the 
tree. 

Each network node NOD El -7 is designed with an 
identical logic circuit, that which is depicted in Rgure 3, 
according to a preferred embodiment of the present 
invention. This circuit is the same circuit used in carry 
look-ahead adder circuits. Therefore, the operation of 
this circuit is well understood and well known by those 
skilled in the art. The organization and operation of a 
binary logic tree using the carry look-ahead circuit as the 
universal link has been described in the prior art. See. 
G.J. Lipovski, "An Organization For Optrcal linkages 
Between Integrated Circuits", NCC 1 977, which is incor- 
porated herein by reference. Tliis paper describes the 
use of a Carry Look-ahead circuit in a binary logic tree 
to configure a broadcast or propagating link optical com- 
munication network. 

Network node 1 00 has three connections to a higher 
level node in the tree: SO, F, and SI; and six connections 
to two lower level nodes in the tree: SOo. Fq. and SIq 
connected to a first lower level node, and SOi. F^. and 
Sli connected to a second lower level node. Each SO 
and SI port is labelled with a w to indicate that the port 
accommodates w-btt-wtde signals. Each of the F ports 
accomnxxdates a 1 -bit-wide signal. 

The SI port has an arrow pointing into the node 100 
to show that the node receives messages from the higher 
level node on that port. The SO and F ports have arrows 
pointing away from the node showing that these are out- 
put signals from the node to a higher level node in the 



binary tree. Similariy, the SIq and Sli have arrows point- 
ing away from node 100 showing that they are outputs 
from node 1 00 and inputs (snoop-in) into their respective 
lower level nodes. Ports Fq, SOq. Fi, and SOi are shown 
5 with anows pointing into node 1 00 to indicate that they 
are outputs from the lower level nodes and inputs into 
node 100. 

The circuit of Figure 3 is combinational, and has no 
registers within it. The logic of the tree works as stipu- 

10 lated when all signals are valid and stable. However, the 
processors and caches which use the tree are independ- 
ently clocked circuits. In some system designs, it may 
therefore be necessary to provide queues at the ports of 
the tree and design an appropriate handshaking mech- 

15 anism for communication between a cache and its tree 
ports. The tree is clocked Independently and works on 
the entries in front of the SO and F queues at its leaf 
ends. (In fact, a separate F queue is not necessary, if an 
empty SO queue implies an asserted F signal.) The tree 

20 fonvards the data to caches over the SI ports. Addition- 
ally, if delays through the tree are not acceptable (for the 
required cycle time of the tree), the tree can be pipelined 
by adding registers at appropriate levels of the tree. 
It should be noted that although the circuit of Rgure 

25 3 simply and efficiently provides the transmission con- 
nections required for the present invention, it will be 
appreciated by those skilled in the art that other circuit 
configurations which provide the same input and output 
connections to provide the same logical function coukJ 

30 also be used in the present invention. For example. Rg- 
ure 11 is a schematic diagram of a logic circuit which 
may be used as a network node in an alternative embod- 
iment of the present invention. Also, the logic of the for- 
ward signals or the snoop-in/snoop-out signals could be 

35 inverted and the binary logic tree circuitry designed to 
operate on these inverted signals as will be appreciated 
by those skilled in the art. 

The operation of the circuit in Figure 3 is predicated 
on the states of the fonvard signals Fq and F^. Therefore. 

40 there are four possible configurations under whk;h the 
logk^ circuit operates. These four configurations are 
shown in Rgures 4-7. 

Rgure 4 diagrams the connections between ports 
in node 100, when both fonn/ard signals from the lower 

45 level nodes are not asserted (i.e. Fq = Fi = 0). Because 
both nodes have negated their fonvard signals, the lower 
level nodes will be outputting coherence messages over 
their SO ports. SOq will be transmitted to Sli through log- 
ical OR-gate OR1 . The negated forward signals with turn 

so off AND-gates AND1 , AND2 and AND3. This allows SOi 
to pass through 0R2 to SO. SI is directly connected to 
Slo. 

The second configuration of Rgure 3 will produce a 
connection of ports in node 100 as diagramed in Rgure 
55 5. tn the second configuration. NodeO (the node con- 
nected to the right branch of node 100 and not shown) 
is not transmitting (i.e., it is forwarding) a coherence mes- 
sage to its next higher level node, in this case node 1 00. 
Therefore, nodeO has asserted its fbnward signal Fq. The 
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Other node connected to node 1 00, node1 . is transnm'tting 
a message to the next higher node, node 1 00, and thus 
has negated its fbnvard signal Fi. With Fi = 0, AND3 
outputs F s 0. The asserted Fq allows SI to transmit 
through AND1 into 0R1. Because, by definition with Fq 
asserted, SOq is not outputting any messages, only the 
output of AND1 is output at port Sh. Again, with Fi = 0, 
AND2 is closed and SOi passes through 0R2 to SO. 

Referring now to Rgure 6, there is diagramed a third 
configuration of the logic circuit of Rgure 3. In this situ- 
ation FKKleO is transmitting a message over the network 
and nodal is not: Fq ° 0 and Fi = 1. Fq closes AND3 to 
produce F = 0. Once again SI is directly connected to 
SIq- Because Fq is negated, it is transmitting messages 
over SOq, which is directly connected to Sli through 
0R1 . The negated Fq closes AND1 as an input into 0R1 . 
The asserted F^ allows SOq to pass through AND2 into 
0R2. By definition, an asserted F^ indicates that no mes- 
sages are output on SOi, and therefore, the output of 
AND2 passes through 0R2 to SO. 

The fourth possible configuration of the logic circuit 
of Figure 3 occurs when neither of the lower level nodes 
are transmitting messages to node 1 00. A diagram of the 
transmission connections for this configuration is shown 
in Rgure 7. Here. Fq = Fi = 1 . These inputs generate F 
= 1 from AND3. SI is directly connected to SIq. Fq is 
asserted, allowing SI to pass through AND1 and 0R1 to 
Sli. NodeO is not transmitting, so SOq does not pass 
through 0R1 to SOi. Although SOi is connected through 
0R2 to SO, and SOq is connected through AND2 and 
0R2 to SO, those connections are not shown to simplify 
the diagram of Rgure 7 since neither node is transmit- 
ting any messages over their snoop-out port. 

Referring again back to Figure 2, root node N0DE1 
is the top level node of the binary logic tree. The SO of 
N0DE1 is directly connected to the SI of N0DE1. The 
two branches of the binary logic tree extending down 
from the root node to the next level nodes N0DE2, 
N0DE3 are comprised of three busses for delivering sig- 
nals. As can be seen from Rgure 2, the connections of 
N0DE1 to N0DE2 are equivalent to the connections 
from node 1 00 to nodeO, as described with Rgure 3, and 
the connections of NO DEI to N0DE3 are equivalent to 
the connecttons of node 1 00 to nodel , as described with 
Rgure 3. 

From each node NODE2, N0DE3, the binary tree 
again branches into two connections to the lower level 
nodes from each node N0DE2. and N0DE3. Each of the 
higher level connections from N0DE4-N0DE7 are con- 
nected to their associated next higher level node's lower 
level connections. The branch nodes NODE4-NODE7 in 
turn have two branch connections to the next lower level 
nodes, in this case, those nodes being the proces- 
sors/caches PO • P7. Each processor PO - P7 having its 
SO. F. and Si connected to the lower level connections 
of the next higher level node O e. NODE4-NODE7). 

For three examples of how the cache coherence net- 
work of the present invention adapts to coherence traffic 
on the networK consider Rgures 8-10. For the first 



exanple, consider the extreme case where every cache 
on the network is attempting to transmit a coherence 
message onto the network. In this extreme case, every 
cache must receive every other cache's message and 

5 potentially might respond with another message for each 
received message. Such a scenario forces the cache 
coherence network into a ring-type network where each 
cache passes a received message on to the next cache 
in the network after determining any response of its own 

10 to the message. 

In the example of Rgure 8, it can be seen that all 
caches are negating their fonArarding signals (F = 0). so 
that they may transmit a coherence message out onto 
the network. Consequently. N0DE4 - 7 will have negated 

15 fonvard inputs from the kwer level nodes. Thus, the logic 
circuit of each node will create transmission connections 
equal to those shown in Rgure 4. as shown in Rgure 8. 
As can be seen from Rgure 4, N0DE4 - 7 will also 
negate their forward signals, resulting in N0DE2 and 

20 NODES being configured as Figure 4. Last, NO DEI also 
has two negated fonward signal inputs, configuring 
N0DE1 as Rgure 4. 

The dashed arrows shown in Rgure 8 indicate data 
flow within the network. As can be seen, with every cache 

25 in the network outputting a message on the network dur- 
ing this current cycle, each cache only transmits to the 
next cache in the network. For example, PO outputs its 
coherence message on its snoop-out (SO) . This amves 
at N0DE4 on its SOq port which is connected to its Sh 

30 port, which delivers PO's coherence message to PI at 
its SI port. PI outputs its message on its SO port which 
arrives at SOi of N0DE4. This is transmitted to the SO 
port of N0DE4. and on to the node at the next higher 
level of the tree. In this case, the next higher node from 

35 N0DE4 Is N0DE2. Here, the message arrives on the 
right branch leading to N0DE2. N0DE2 is configured to 
transfer this message back down the left branch to 
NODES. In turn, NODES connects its SI port to the SI 
port of the right branch node at the next lower level from 

40 NODES, in this case, that node being P2. It can be seen 
then, that the coherence message output from PI is 
transmitted through N0DE4. up to N0DE2, back down 
to NODES, and then arriving at P2. 

By inspecting the transmission paths of the remain- 

45 der of the processors, it can be seen that each processor 
passes its coherence message on to only the next proc- 
essor in the network. Because that next processor is also 
transmitting a message onto the networK the message 
from the previous processor is necessarily clipped and 

50 is not sent on to any other processors in the system. This 
can be understood, with reference to Figure 8. by notic- 
ing that data moves in one direction within the network. 
Because of the particular logic circuit used in the pre- 
fen'ed embodiment, data generally travels from the right- 

55 hand side of the network to the left-hand skJe before 
passing over the top of the tree to transmit to the remain- 
der of the tree on the right-hand side. Thus, in the pre- 
ferred embodiment, the next processor in the network is 
the next processor to the left. 
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In Rgure 8. the network has formed a ring-network. 
In this networK each processor passes a network mes- 
sage on to the next processor. The caches continue to 
pass the message along to the next cache in the ring 
with each cycle of the network until every cache in the 
network has received the message. 

Referring now to Rgure 9. there is depicted a dia- 
gram of the data flow for a second example within a pre- 
ferred embodiment of the cache coherence network of 
the present invention. In this extreme example, only one 
processor in the network is attempting to transmit a 
coherence message over the network during the cunrent 
cyde. Because no other messages are being sent over 
the network during the current cycle, the one processor 
transmitting over the network is able to broadcast its 
message to every other processor within this one cycle. 
Here, PI is transmitting a message, and, therefore, has 
negated its forward signal. All other caches, having not 
transmitted a message, have asserted their fonward sig- 
nals. (P2 ■ P7 have F = 1) . Therefore, NODEsS - 7 are 
configured as shown in Rgure 7. Each of these nodes 
assert their forward signals. This results in N0DE3 being 
configured as shown in Figure 7. N0DE4 receives a 
negated forward signal from its left branch and an 
asserted fonfvard signal from its right branch, coming 
from the processor nodes PI and PO, respectively. TNs 
places N0DE4 in the configuration of Rgure 5. NODE2 
receives an asserted iomard signal from NODES and a 
negated forward signal from N0DE4, configuring it as 
shown in Rgure 6. Similarly, N0DE1 receives an 
asserted forward signal from NODES on its left branch, 
and a negated fonward signal from N0DE2 on its right 
branch, resulting in a configuration as shown in Rgure 6. 

Given this structure of the network connections dur- 
ing the current cycle, the dashed arrows in Rgure 9 
desaibe the direction of coherence message transmis- 
sion from processor PI to the rest of the processors con- 
nected to the cache coherence network. The message 
output from PI *s SO port passes through NODE4 up into 
N0DE2, where the message is transferred both back 
down the left branch from N0DE2 to NODES, and up 
through N0DE2 and up along the right branch of 
N0DE1. The message wrapping around from PI 
through NODES is then transferred back down both the 
left and right branches of NODES to processors P2 and 
P3. The message also is transmitted through SO of 
node2 along the right branch of N0DE1 . This message 
is Iransf ened back down the left branch of N0DE1 to be 
broadcast back down the entire left-hand side of the 
binary logic tree so that P4 - P7 receive the message. 
The message is also transmitted up through the SO port 
of N0DE1 , which wraps back down through the right- 
hand branch of N0DE1 into N0DE2, and again down 
the right-hand branch of N0DE2 into N0DE4, where the 
message is passed down both branches of NODE4 into 
POandPi. 

As can be seen from the above desaiption of Figure 
9. the cache coherence network of the present invention 
was able to adapt itself to a broadcast network so that a 



single processor was able to broadcast the message to 
the entire network within one cycle of the cache coher- 
ence system. The message spreads out along the 
branches of the tree to all processors to the left of the 

5 broadcasting processor that are within the broadcaster's 
half of the binary logic tree. When the broadcasted mes- 
sage reaches the root node, N0DE1, the message is 
passed back down along the right-hand side of the 
broadcasting processor's half of the tree so that all proc- 

10 essors to the right of the broadcasting processor and its 
half of the tree receives the message. At the same time, 
the message is broadcast down from the root node to all 
processors in the entire other half of the binary logic tree. 
In the broadcast mode, the broadcasting processor will 

15 also receive its own message. It has been explained, the 
received message will contain an identification fiekl 
which irxlicates to the broadcasting cache that the 
received message was its own, and thus, should be 
ignored. 

20 Referring now to Rgure 1 0, there is depicted a third 
example of the connections and data transmission in a 
prefenred embodiment of the cache coherence network 
of the present invention during a particular cyde of the 
network. This example shows how the present invention 

25 can adapt to provkf e a combination of the ring and broad- 
cast networks urxfer conditions between the two 
extremes described in the examples of Rgure 8 and Rg- 
ure 9. 

In this example, for the current cycle, processors P1 , 

30 P2. P4, and PS are transmitting coherence messages 
onto the network, as is irxiicated by their negated fonward 
signals. Processors PO, P3, P6, and P7 are not transmit- 
ting onto the network during the current cycle, as is indi- 
cated by their asserted forward signals. 

35 A 0-1 fonward signal input into N0DE4 configures it 
as Rgure S. A 1 -0 forward signal input into NODES con- 
figures it as Figure 6. A 0-0 fonward signal input into 
NODES configures it as Rgure 4. A 1-1 forward signal 
input into N0DE7 configures it as Rgure 7. The fonward 

40 signals from both N0DE4 and NODES are negated, con- 
figuring N0DE2 as seen in Figure 4. The fonward signal 
of NODES is negated and the fonward signal of N0DE7 
is asserted, configuring NODES as shown in Rgure 6. 
The fonward signals of N0DE2 and NODES are both 

45 negated, configuring N0DE1 as seen in Rgure 4. 

Processor Pi's message will pass through N0DE4 
up the right branch of N0DE2, down the left branch into 
NODES, and down the right branch of NODES into proc- 
essor P2. Because processor P2 was also broadcasting 

50 a message during this cycle. NODES could not be con- 
f^ured to allow both PI and P2's message to be trans- 
ferred down the left branch of NODES into PS. Thus, PI s 
message is clipped at P2. and P2 must maintain Pi's 
coherence message in its SI queue to be retransmitted 

55 to the rest of the network on the next or a succeeding 
cycle. Also, PVs message was not transmitted up 
through SO of NODE2 because another processor to the 
left of PI at a leaf of the N0DE2 branch was also trans- 
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mitting, and, therefore, was given the connection to the 
snoop-out of N0DE2. 

As can be seen from Figure 10, P2*s message 
passed back down the left branch of NODES to P3 and 
up through SO of NODES, through to the SO of N0DE2 5 
to N0DE1. At NODE1, the message was transmitted 
back down the left branch of N0DE1 to N0DE3. The 
message from P2 passes down the right branch of 
N0DE3 and the right branch of NODES into P4. There 
the message is clipped because of P4's transmission, w 
P4's transmission is also clipped by P5*s transmission, 
and therefore is passed only to NODES and then back 
down to PS. 

PS's message passes up through NODES and then 
up along the right branch of NODES, where it is sent both is 
back down the left branch of NODES, and up to the next 
higher level node. N0DE1 . The message passing down 
the left branch of NODES is broadcast through NODE? 
into processors PS and P7. The message sent through 
the snoop-out of NODES routes back up through 20 
NODE1 . and down along the right branch of N0DE1 and 
N0DE2 Into N0DE4. At N0DE4, the message of P5 is 
broadcast to both PO and PI . 

When using the network of the present embodiment 
there is a danger that a cache initiating a coherence mes- 25 
sage might assert its fonward signal during a cycle that 
one of its own messages cunrently pending on the net- 
work is delivered back to it. The problem arises in that 
now the message has been passed over to a sut)sequent 
processor once again, which will refonward it throughout 30 
the network, and that this could continue forever since 
the initiating cache may continue to assert its fonward sig- 
nal. 

To conect for this danger of a continuously for- 
warded request, additional logic can be added to the net- 35 
work. At the root level node in the network, the path that 
fonwards data from the left half of the tree to the right half 
of the tree would have a decrementer, and so would the 
path that goes in the opposite direction. Each coherence 
request sent out over the network would contain an addi- 40 
tional 2-bit value that is decremented every time the mes- 
sage traverses between the two-half trees. An additional 
bit in the request carries a flag stating that the request is 
valid or invalid. This flag bit is set to true" by the initiating 
cache, while the queue value is set to 2. The flag is turned 45 
to invalid when the count is already 0 when the request 
reaches the root node. All invalid requests received are 
to be discarded at once by a receiving cache. If a unitary 
coding of 2 is used, an easy implementation of the dec- 
rementers is a mere inverter. The right to left transfer so 
negates one bit of the two bit count value, and the left to 
right transfer negates the other. The logical OR-ing of the 
2-count bits as they come into the root node generates 
the valid bit 

Another problem is that of detecting whether all 55 
caches have responded or deciding that no more caches 
would respond at a later time to a particular coherence 
message. This problem arises because of the adaptabil- 
ity of the cache coherence network. As coherence traffic 



increases, the number of messages clipped increases, 
which necessarily delays the transmission of requests 
and responses by additional cycles. This problem is best 
solved by means of a protocol and time-out mechanism 
that assumes an upper bound on the delay that each 
cache may introduce in the path of a message and of the 
conesponding response, assuming that each is clipped 
at every cycle, and that adding up these delays will pro- 
duce an upper bound on the time after which no 
responses may be expected by any caches in the net- 
work. 

Although the present invention has been described 
in a scheme based on a binary tree, the present invention 
can easily be generalized to any M-ary tree. It has been 
shown in the literature that a modified binary tree can be 
imbedded in a hypercube. See. "Scalability Of A Binary 
Tree On A Hypercube", S.R. Deshpande and P.M. Jen- 
evein, ICPP 1986, incorporated herein by reference. This 
technique can be applied to achieving snoopy protocol 
in a hypercube based multiprocessor system. 

In summary, the cache coherence network of the 
present invention automatically adapts to the coherence 
traffic on the network to provide the most efficient trans- 
mission of coherence messages. The network adapts to 
a broadcast network or a ring network, or any combina- 
tion in between, as a function of which caches attached 
to the network are attempting to transmK coherence traf- 
fic on the network Thus, branches of the binary logic 
tree with light coherence traffic may be predominately 
configured In a broadcs^t configuration to allow coher- 
ence messages to be quickly delivered to each cache 
within that branch. Still other branches with heavy coher- 
ence traffic will automatically adapt to this increased traf- 
fic and configure themselves predominately in a ring 
network. 

While the invention has been particularly shown and 
described with reference to a preferred embodiment, it 
will be understood by those skilled in the art that various 
changes in form and detail may be made therein without 
departing from the scope of the invention. 

Claims 

1 . A cache coherence network for transferring coher- 
ence messages between processor caches in a mul- 
tiprocessor data processing system, the network 
comprising: 

a plurality of processor caches associated 
with a plurality of processors, each cache having a 
snoop-in input, a snoop-out output, and a forward 
output, wherein the snoop-in input is arranged to 
receive coherence messages and the snoop-out is 
arranged to output, at the most, one coherence mes- 
sage per current cycle of the network timing, and 
an'anged so that a fonward signal on the fonward out- 
put indicates that the cache is outputting a message 
on the snoop-out output during the cunrent cycle, 
wherein a cache is anranged to generate coherence 
messages according to a coherency protocol, and. 
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further, wherein each cache is arranged to store 
messages received on the snoop-in input in a mes- 
sage queue and to output messages loaded in the 
queue on the snoop-out output, after determining 
any response message based on the received mes- 
sage; and 

a binary logic tree circuit having a plurality of 
binary nodes connected in a binary tree structure, 
starting at a top root node and having multiple 
branches Ibrmed of branch nodes positioned at mul- 
tiple levels of a branch, and each branch node hav- 
ing a snoop-in. a snoop-out. and a forward 
connected to each of a next higher level node and 
two lower level nodes, such that a branch node Is 
connected to a higher node at a next higher level of 
the tree structure, and to a first lower node and sec- 
ond lower node at a next lower level of the tree struc- 
ture, and arranged so that a fonward signal on a 
forward Indicates that the associated node is output- 
ting a message on snoop-out to the higher node dur- 
ing the current cycle, and wherein each branch ends 
with multiple connections to a cache at the cache's 
snoop-in input, snoop-out output, and fbnward out- 
put, wherein the cache forms a bottom level node. 

2. A cache coherence network as claimed in Claim 1 , 
wherein a node Is arranged to transmit a message 
received on the snoop-in from the higher node to the 
snoop-in of the first lower level node, to transmit a 
message received on the snoop-out of the first lower 
level node to the snoop-in of the second lower level 
node, and to transmit a message received on the 
snoop-out of the second lower level node to the 
snoop-out going to the higher level node, when the 
first and second lower nodes are transmitting coher- 
ency messages during the current cycle. 

3. A cache coherence network of as claimed in Claim 
1 or Claim 2. wherein a node is arranged to transmit 
a message received on the snoop-in from the higher 
node to the snoop-in of the first lower level node, and 
to transmit a message received on the snoop-out of 
the first lower level node to both the snoop-in of the 
second lower level node and the snoop-out going to 
the higher level node, when the first lower node is 
arranged to transmit a coherence message and the 
second lower node is not transmitting a coherency 
message during the current cyde. 

4. A cache coherence network as claimed in any pre- 
ceding claim, wherein a node is arranged to transmit 
a message received on the snoop-in from the higher 
node to the snoop-in of the first lower level node and 
the snoop-in of the second lower level node, when 
the first and second lower nodes are not transmitting 
coherency messages during the current cycle. 

5. A cache coherence network as claimed in any pre- 
ceding claim, wherein a node is arranged to transmit 



a message received on the snoopnn from the higher 
node to both the snoop-in of the first lower level node 
and the snoop-in of the second lower level node, and 
to transmit a message received on the snoop-out of 
5 the second lower level node to the snoop-out going 
to the higher level node, when the first lower node is 
not transmitting a coherence message and the sec- 
ond lower node is transmitting a coherency mes- 
sage during the current cyde. 

10 

6. A cache coherence network as claimed in any pre- 
ceding daim, wherein the root node has the snoop- 
out to the higher node connected to the snoop-in 
from the higher node. 

15 

7. A cache coherence network as claimed in any pre- 
ceding claim, wherein a cache is arranged to assert 
a fonvard signal on the fonivard output when the 
cache is not transmitting a coherence message on 

20 the snoop-out output, and to negate the fonvard sig- 
nal on the fonward output virhen the cache is trans- 
mitting a coherence message during the current 
cycle. 

25 8. A cache coherence network as datmed in any pre- 
ceding claim, wherein all nodes of the binary logic 
tree drcuit are carry look-ahead circuits. 
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